Search CORE

34 research outputs found

Deploying Jupyter Notebooks at scale on XSEDE resources for Science Gateways and workshops

Author: Jette Morris A.
Kluyver Thomas
Weil Sage A
Wilkins-Diehr Nancy
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/07/2018
Field of study

Jupyter Notebooks have become a mainstream tool for interactive computing in every field of science. Jupyter Notebooks are suitable as companion applications for Science Gateways, providing more flexibility and post-processing capability to the users. Moreover they are often used in training events and workshops to provide immediate access to a pre-configured interactive computing environment. The Jupyter team released the JupyterHub web application to provide a platform where multiple users can login and access a Jupyter Notebook environment. When the number of users and memory requirements are low, it is easy to setup JupyterHub on a single server. However, setup becomes more complicated when we need to serve Jupyter Notebooks at scale to tens or hundreds of users. In this paper we will present three strategies for deploying JupyterHub at scale on XSEDE resources. All options share the deployment of JupyterHub on a Virtual Machine on XSEDE Jetstream. In the first scenario, JupyterHub connects to a supercomputer and launches a single node job on behalf of each user and proxies back the Notebook from the computing node back to the user's browser. In the second scenario, implemented in the context of a XSEDE consultation for the IRIS consortium for Seismology, we deploy Docker in Swarm mode to coordinate many XSEDE Jetstream virtual machines to provide Notebooks with persistent storage and quota. In the last scenario we install the Kubernetes containers orchestration framework on Jetstream to provide a fault-tolerant JupyterHub deployment with a distributed filesystem and capability to scale to thousands of users. In the conclusion section we provide a link to step-by-step tutorials complete with all the necessary commands and configuration files to replicate these deployments.Comment: 7 pages, 3 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, US

arXiv.org e-Print Archive

Crossref

From Bare Metal to Virtual: Lessons Learned when a Supercomputing Institute Deploys its First Cloud

Author: Bell T.
Merkel Dirk
Services Amazon Web
Weil Sage A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/07/2018
Field of study

As primary provider for research computing services at the University of Minnesota, the Minnesota Supercomputing Institute (MSI) has long been responsible for serving the needs of a user-base numbering in the thousands. In recent years, MSI---like many other HPC centers---has observed a growing need for self-service, on-demand, data-intensive research, as well as the emergence of many new controlled-access datasets for research purposes. In light of this, MSI constructed a new on-premise cloud service, named Stratus, which is architected from the ground up to easily satisfy data-use agreements and fill four gaps left by traditional HPC. The resulting OpenStack cloud, constructed from HPC-specific compute nodes and backed by Ceph storage, is designed to fully comply with controls set forth by the NIH Genomic Data Sharing Policy. Herein, we present twelve lessons learned during the ambitious sprint to take Stratus from inception and into production in less than 18 months. Important, and often overlooked, components of this timeline included the development of new leadership roles, staff and user training, and user support documentation. Along the way, the lessons learned extended well beyond the technical challenges often associated with acquiring, configuring, and maintaining large-scale systems.Comment: 8 pages, 5 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, US

arXiv.org e-Print Archive

Crossref

Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

Author: Bisong Ekaba
Miao Zhuqi
Scheufele Elisabeth
Weil Sage A
Winn Peter A
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/07/2020
Field of study

The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202

arXiv.org e-Print Archive

Crossref

Targeted stem cells expressing TRAIL as a therapy for lung Cancer TACTICAL: a phase I/II trial

Author: Bain O
Champion K
Davies A
Day A
Edwards A
Forster M
Fullen D
Janes SM
Kalber T
Kolluri K
Lowdell M
Lythgoe M
Patrick S
Popova B
Rego RVTP
Sage E
Santilli G
Thakrar R
Thrasher A
Weil B
Publication venue: ELSEVIER IRELAND LTD
Publication date: 01/01/2018
Field of study

UCL Discovery

A survey and classification of software-defined storage systems

Author: Alysson Bessani
Angel Sebastian
Anwar Ali
Anwar Ali
Belaramani Nalini M.
Belay Adam
Carl
Cully Brendan
Frank
Ghodsi Ali
Gracia-Tinedo Raúl
Gulati Ajay
Gulati Ajay
Hat Red
Hsu Chin-Jung
Hunt Patrick
José Pereira
João Paulo
Kim Hyeong-Jun
Klimovic Ana
Koponen Teemu
Li Ning
Lumb Christopher R.
Mace Jonathan
Mesnier Michael
Murugan Muthukumar
Ongaro Diego
Peter Simon
Qian Yingjin
Raghavan Ajaykrishna
Ricardo Macedo
Riedel Erik
Schroeder Bianca
Schwan Philip
Seshadri Sudharsan
Sevilla Michael A.
Shan Yizhou
Shue David
Shue David
Soheil
Song Huaiming
Stefanovici Ioan
Weil Sage A.
Wires Jake
Yang Bin
Yang Suli
Zhang Xuechen
Zhu Timothy
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

The exponential growth of digital information is imposing increasing scale and efficiency demands on modern storage infrastructures. As infrastructure complexity increases, so does the difficulty in ensuring quality of service, maintainability, and resource fairness, raising unprecedented performance, scalability, and programmability challenges. Software-Defined Storage (SDS) addresses these challenges by cleanly disentangling control and data flows, easing management, and improving control functionality of conventional storage systems. Despite its momentum in the research community, many aspects of the paradigm are still unclear, undefined, and unexplored, leading to misunderstandings that hamper the research and development of novel SDS technologies. In this article, we present an in-depth study of SDS systems, providing a thorough description and categorization of each plane of functionality. Further, we propose a taxonomy and classification of existing SDS solutions according to different criteria. Finally, we provide key insights about the paradigm and discuss potential future research directions for the field.This work was financed by the Portuguese funding agency FCT-Fundacao para a Ciencia e a Tecnologia through national funds, the PhD grant SFRH/BD/146059/2019, the project ThreatAdapt (FCT-FNR/0002/2018), the LASIGE Research Unit (UIDB/00408/2020), and cofunded by the FEDER, where applicable

Universidade do Minho: RepositoriUM

Crossref

Abstract

Author: Sage A. Weil
Scott A. Brandt
Publication venue
Publication date
Field of study

In petabyte-scale distributed file systems that decouple read and write from metadata operations, behavior of the metadata server cluster will be critical to overall system performance. We examine aspects of the workload that make it difficult to distribute effectively, and present a few potential strategies to demonstrate the issues involved. Finally, we describe the advantages of intelligent metadata management and a simulation environment we have developed to validate design possibilities.

CiteSeerX

Abstract

Author: Kristal T. Pollack
Sage A. Weil
Publication venue
Publication date
Field of study

In petabyte-scale distributed file systems that decouple read and write from metadata operations, behavior of the metadata server cluster will be critical to overall system performance and scalability. We present a dynamic subtree partitioning and adaptive metadata management system designed to efficiently manage hierarchical metadata workloads that evolve over time. We examine the relative merits of our approach in the context of traditional workload partitioning strategies, and demonstrate the performance, scalability and adaptability advantages in a simulation environment.

CiteSeerX